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Abstract.  In  this  paper  we  study  post-penalized  estimators  which  apply  ordinary,  unpenal- 
ized  linear  regression  to  the  model  selected  by  first-step  penalized  estimators,  typically  LASSO. 
It  is  well  known  that  LASSO  can  estimate  the  regression  function  at  nearly  the  oracle  rate,  and 
is  thus  hard  to  improve  upon.  We  show  that  post-LASSO  performs  at  least  as  well  as  LASSO 
in  terms  of  the  rate  of  convergence,  and  has  the  advantage  of  a  smaller  bias.  Remarkably,  this 
(y^  performance  occurs  even  if  the  LASSO-based  model  selection  "fails"   in  the  sense  of  missing 

fl  some  components  of  the  "true"  regression  model.  By  the  "true"  model  we  mean  here  the  best 

(--♦  s-dimensional  approximation  to  the  regression  function  chosen  by  the  oracle.    Furthermore, 

C  post-LASSO  can  perform  strictly  better  than  LASSO,  in  the  sense  of  a  strictly  faster  rate 

'  .  of  convergence,  if  the  LASSO-based  model  selection  correctly  includes  all  components  of  the 

^vj  "true"  model  as  a  subset  and  also  achieves  a  sufficient  sparsity.    In  the  extreme  case,  when 

?*  LASSO  perfectly  selects  the  "true"  model,  the  post-LASSO  estimator  becomes  the  oracle  esti- 

OC 

,-ys  mator.  An  important  ingredient  in  our  analysis  is  a  new  sparsity  bound  on  the  dimension  of  the 

^~~<  model  selected  by  LASSO  which  guarantees  that  this  dimension  is  at  most  of  the  same  order 

"*"%  as  the  dimension  of  the  "true"  model.    Our  rate  results  are  non-asymptotic  and  hold  in  both 

^..^^  parametric  and  nonparametric  models.  Moreover,  our  analysis  is  not  limited  to  the  LASSO  es- 

(^^  timator  in  the  first  step,  but  also  applies  to  other  estimators,  for  example,  the  trimmed  LASSO, 

^""  Dantzig  selector,  or  any  other  estimator  with  good  rates  and  good  sparsity.  Our  analysis  covers 

p-  both  traditional  trimming  and  a  new  practical,  completely  data-driven  trimming  scheme  that 

K^  induces  maximal  sparsity  subject  to  maintaining  a  certain  goodness-of-fit.   The  latter  scheme 

^  .  has  theoretical  guarantees  similar  to  those  of  LASSO  or  post-LASSO,  but  it  dominates  these 
procedures  as  well  as  traditional  trimming  in  a  wide  variety  of  experiments. 
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■       .  1.  Introduction 

In  this  work  we  study  post-model  selected  estimators  for  linear  regi'ession  in  high-dimensio- 
nal sparse  models  (HDSMs).  In  such  models,  the  overall  number  of  regressors  p  is  very  large, 
possibly  much  larger  than  the  sample  size  n.  However,  the  number  s  of  significant  regressors  - 
those  having  a  non-zero  impact  on  the  response  variable  ~  is  smaller  than  the  sample  size,  that 
is,  s  =  o{n).  HDSMs  ([6],  [13])  have  emerged  to  deal  with  many  new  applications  arising  in 
biometrics,  signal  processing,  machine  learning,  econometrics,  and  other  areas  of  data  analysis 
where  high-dimensional  data  sets  have  become  widely  available. 

Several  papers  have  begun  to  investigate  estimation  of  HDSMs,  primarily  focusing  on  penal- 
ized mean  regression,  with  the  £i-norm  acting  as  a  penalty  function  [2,  6,  10,  13,  17,  20,  19]. 
[2,  6,  10,  13,  20,  19]  demonstrated  the  fundamental  result  that  fi-penalized  least  squares  es- 


timators achieve  the  rate  ys/n^/Xogp,  which  is  very  close  to  the  oracle  rate  \J sjn  achievable 
when  the  true  model  is  known.  [17]  demonstrated  a  similar  fundamental  result  on  the  excess 
forecasting  error  loss  under  both  quadratic  and  non-quadratic  loss  functions.  Thus  the  estima- 
tor can  be  consistent  and  can  have  excellent  forecasting  performance  even  under  very  rapid, 
nearly  exponential  growth  of  the  total  number  of  regressors  p.  [1]  investigated  the  £i-penalized 
quantile  regression  process,  obtaining  similar  results.  See  [9,  2,  3,  4,  5,  11,  12,  15]  for  many 
other  interesting  developments  and  a  detailed  review  of  the  existing  literature. 

In  this  paper  we  derive  theoretical  properties  of  post-penalized  estimators  which  apply  ordi- 
nary, unpenalized  linear  least  squares  regi'ession  to  the  model  selected  by  first-step  penalized 
estimators,  typically  LASSO.  It  is  well  known  that  LASSO  can  estimate  the  mean  regi'ession 
function  at  nearly  the  oracle  rate,  and  hence  is  hard  to  improve  upon.  We  show  that  post- 
LASSO  can  perform  at  least  as  well  as  LASSO  in  terms  of  the  rate  of  convergence,  and  has 
the  advantage  of  a  smaller  bias.  This  nice  performance  occurs  even  if  the  LASSO-based  model 
selection  "fails"  in  the  sense  of  missing  some  components  of  the  "true"  regi'ession  model.  Here 
by  the  "true"  model  we  mean  the  best  s-dimensional  approximation  to  the  regression  function 
chosen  by  the  oracle.  The  intuition  for  this  result  is  that  LASSO-based  model  selection  omits 
only  those  components  with  relatively  small  coefficients.  Furthermore,  post-LASSO  can  per- 
form strictly  better  than  LASSO,  in  the  sense  of  a  strictly  faster  rate  of  convergence,  if  the 
LASSO-based  model  correctly  includes  all  components  of  the  "true"  model  as  a  subset  and  is 
sufficiently  sparse.  Of  course,  in  the  extreme  case,  when  LASSO  perfectly  selects  the  "true" 
model,  the  post-LASSO  estimator  becomes  the  oracle  estimator. 
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Importantly,  our  rate  analysis  is  not  limited  to  the  LASSO  estimator  in  the  first  step,  but 
applies  to  a  wide  variety  of  other  first-step  estimators,  including,  for  example,  trimmed  LASSO, 
the  Dantzig  selector,  and  their  various  modifications.  We  give  generic  rate  results  that  cover  any 
first-step  estimator  for  which  a  rate  and  a  sparsity  bound  are  available.  We  also  give  a  generic 
result  on  trimmed  first-step  estimators,  where  trimming  can  be  performed  by  a  traditional  hard- 
thresholding  scheme  or  by  a  new  trimming  scheme  we  introduce  in  the  paper.  Our  new  trimming 
scheme  induces  maximal  sparsity  subject  to  maintaining  a  certain  goodness-of-fit  (goof)  in  the 
sample,  and  is  completely  data-driven.  We  show  that  our  post-goof-trimmed  estimator  performs 
at  least  as  well  as  the  first-step  estimator;  for  example,  the  post-goof-trimmed  LASSO  performs 
at  least  as  well  as  LASSO,  but  can  be  strictly  better  under  good  model  selection  properties. 
It  should  also  be  noted  that  traditional  trimming  schemes  do  not  in  general  have  such  nice 
theoretical  guarantees,  even  in  simple  diagonal  models. 

Finally,  we  conduct  a  series  of  computational  experiments  and  find  that  the  results  confirm 
our  theoretical  findings.  In  particular,  we  find  that  the  post-goof-trimmed  LASSO  and  post- 
LASSO  emerge  clearly  as  the  best  and  second  best,  both  substantially  outperforming  LASSO 
and  the  post-traditional-trimmed  LASSO  estimators. 

To  the  best  of  our  knowledge,  our  paper  is  the  first  to  establish  the  aforementioned  rate 
results  on  post-LASSO  and  the  proposed  post-goof-trimmed  LASSO  in  the  mean  regression 
problem.  Our  analysis  builds  upon  the  ideas  in  [1],  who  established  the  properties  of  post- 
penalized  procedures  for  the  related,  but  diflferent,  problem  of  median  regression.  Our  analysis 
also  builds  on  the  fundamental  results  of  [2]  and  the  other  works  cited  above  that  established 
the  properties  of  the  first-step  LASSO-type  estimators.  An  important  ingredient  in  our  analysis 
is  a  new  sparsity  bound  on  the  dimension  of  the  model  selected  by  LASSO,  which  guarantees 
that  this  dimension  is  at  most  of  the  same  order  as  the  dimension  of  the  "true"  model.  This 
result  builds  on  some  inequalities  for  sparse  eigenvalues  and  reasoning  previously  given  in  [1]  in 
the  context  of  median  regression.  Our  sparsity  bounds  for  LASSO  improve  upon  the  analogous 
bounds  in  [2]  and  are  comparable  to  the  bounds  in  [20]  obtained  under  a  larger  penalty  level.  We 
also  rely  on  maximal  inequalities  in  [20]  to  provide  primitive  conditions  for  the  sharp  sparsity 
bounds  to  hold. 

We  organize  the  remainder  of  the  paper  as  follows.  In  Section  2,  we  review  some  benchmark 
results  of  [2]  for  LASSO,  albeit  with  a  slightly  improved  choice  of  penalty,  and  model  selection 
results  of  [11,  13,  21].  In  Section  3,  we  present  a  generic  rate  result  on  post-penalized  estimators. 
In  Section  4,  we  present  a  generic  rate  result  for  post-trimmed-estimators,  where  trimming  can 
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be  traditional  or  based  on  goodness-of-fit.  In  Section  5,  we  apply  our  generic  results  to  the 
post-LASSO  and  the  post-trimmed  LASSO  estimators.  In  Section  6  we  present  the  results  of 
our  computational  experiments. 

Notation.  In  what  follows,  all  parameter  values  are  indexed  by  the  sample  size  n,  but  we 
omit  the  index  whenever  this  does  not  cause  confusion.  We  use  the  notation  (a)+  =  max{a,0}, 
aV  b  —  niax{a,b}  and  a  A  b  =  min{a,  6).  The  ^2-norm  is  denoted  by  ||  •  |j  and  the  £o-norm 
II  •  llo  denotes  the  number  of  non-zero  components  of  a  vector.  Given  a  vector  5  G  IR^,  and  a 
set  of  indices  T  C  {1,  ■ .  •  ,p},  we  denote  by  6t  the  vector  in  which  6tj  —  Sj  if  j  G  T,  5tj  =  0  if 
j  ^  T.  We  also  use  standard  notation  in  the  empirical  process  literature,  E„[/]  —  E„[/(2j)]  = 
Er=i  f{zi)/n,  and  G„(/)  =  Er=i (/(->)  "  E[f{z.,)])/V^.  We  use  the  notation  a  <  6  to  denote 
a  <  cb  for  some  constant  c  >  0  that  does  not  depend  on  n;  and  a  <p  b  to  denote  a  =  Op{b). 
For  an  event  E,  we  say  that  E  wp  — ^  1  when  E  occurs  with  probability  approaching  one  as  n 
grows. 

2.  LASSO  AS  A  Benchmark  in  Parametric  and  Nonparametric  Models 

The  purpose  of  this  section  is  to  define  the  models  for  which  we  state  our  results  and  also  to 
revisit  some  known  results  for  the  LASSO  estimator,  which  we  will  use  as  a  benchmark  and  as 
inputs  to  subsequent  proofs.  In  particular,  we  revisit  the  fundamental  rate  results  of  [2],  but 
with  a  slightly  improved,  data-driven  penalty  level. 

2.1.  Model  1:  Parametric  Model.  Let  us  consider  the  following  parametric  linear  regi-ession 
model: 

y,  =  x%  +  e»,    c,  ~  A^(0,  a^),    PqEW,    z  -  1, . . . ,  n 

T  =  support(/3o)  has  s  elements  where  s  <  n,  but  p  >  n, 

where  T  is  unknown,  and  regi'essors  A'  =  [a;i, . . .  ,Xn]'  are  fixed  and  normalized  so  that  aj  = 
E„[4)  =  lforallj  =  l,...,7A 

Given  the  large  number  of  regressors  p  >  n,  some  regularization  is  required  in  order  to 
avoid  overfitting  the  data.  The  LASSO  estimator  [16]  is  one  way  to  achieve  this  regularization. 
Specifically,  define 

^e  arg  min  Q{(3)  +  -||/3||i,    where   Q(/3)  =  E„[2/,  -  x'^pf.  (2.1) 


Our  goal  is  to  revisit  convergence  results  for  0  in  the  prediction  (pseudo)  norm, 


\\S\\2,n  =   \/E„K5] 


r'A12 
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The  key  quantity  in  the  analysis  is  the  gi-adient  at  the  true  value: 

5  =  2E„[x',e,]. 

This  gradient  is  the  effective  "noise"  in  the  problem.   Indeed,  for  S  =  p  —  po,  we  have  by  the 
Holder  inequality 

W)  -  Q(/3o)  -  mln  =  2Enhx'S  >  -||5i|oolld1|i.  (2-2) 

Thus,  Qip)  —  QiPo)  provides  noisy  information  about  ||(5||2„,  and  the  amount  of  noise  is 
controlled  by  ||5'||oo||'5||i-  This  noise  should  be  dominated  by  the  penalty,  so  that  the  rate  of 
convergence  can  be  deduced  from  a  relationship  between  the  penalty  term  and  the  quadratic 
term  \\5\\l^. 

This  reasoning  suggests  choosing  A  so  that 

A  >  cn||S'||oo,     for  some  fixed   c  >  1. 

However  this  choice  is  not  feasible,  since  we  do  not  know  S.  We  propose  setting 

X  =  c-A{l-a\X)  (2.3) 

where  A(l  —  a\X)  is  the  (1  —  a)-ciuantile  of  'n||5||cx)5  so  that  for  this  choice 

A  >  c?i||5||oo  with  probability  at  least  I  —  a.  (2.4) 

Note  that  the  quantity  A(l  ~  a|.Y)  is  easily  computed  by  simulation.  We  refer  to  this  choice  of 
A  as  the  data-driven  choice,  reflecting  the  dependence  of  the  choice  on  the  design  matrix  X. 

Comment  2.1  (Data-driven  choice  vs  standard  choice.).  The  standard  choice  of  A  employs 


X  =  c■(TA^/2nlogp,  .  (2.5) 

where  A  >  1  is  a  constant  that  does  not  depend  on  X,  chosen  so  that  (2.4)  holds  no  matter  what 
X  is.  Note  that  v^ll'^'lloo  is  a  maximum  of  A^(0,  a"^)  variables,  which  are  correlated  if  columns  of 
X  axe  correlated,  as  they  typicaUy  are  in  the  sample.  In  order  to  compute  A,  the  standard  choice 
uses  the  conservative  assumption  that  these  variables  are  uncorrelated.  When  the  variables  are 
highly  correlated,  the  standard  choice  (2.5)  becomes  quite  conservative  and  may  be  too  large. 
The  X-dependent  choice  of  penalty  (2.3)  takes  advantage  of  the  in-sample  correlations  induced 
by  the  design  matrix  and  yields  smaller  penalties.  To  illustrate  this  point,  we  simulated  many 
designs  X  by  drawing  x,  as  i.i.d.  from  A'^(0,  E),  and  defining  Xij  =  Xij/wE„[if  ],  with  Ej,  =  1, 
and  varying  correlations  Ejfc  for  j  ^  k  among  three  design  options:   0,  p^^~^\  or  p.    We  then 
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computed  X-dependent  penalty  levels  (2.3).  Figure  1  plots  the  sorted  realized  values  of  the  X- 
dependent  A  and  illustrates  the  impact  of  in-sample  correlation  on  these  values.  As  expected,  for 
a  fixed  confidence  level  1  —  a,  the  more  correlated  the  regressors  are,  the  smaller  the  data-driven 
penalty  (2.3)  is  relative  to  the  standard  conservative  choice  (2.5). 


Ji        76 
> 


.Standard  Bound 

-  -   Data-dependent  A,  Design  1 

Data-dependent  A,  Design  2 

Data-dependent  A,  Design  3 


Quantilc 

Figure  1.  ReaUzod  values  of  A(0.95|A')  sorted  in  increasing  order.  X  is  drawn  by  generating 
x^  as  i.i.d.  A'^(0,  S),  where  (or  j  ^  k  design  1  has  Sjk  =  0,  design  2  has  Sj^  =  (1/2)'-'"''',  and 
design  3  has  T,jk  =  1/2.  We  used  n  =  100,  p  =  500  and  a^  =  1.  For  each  design  100  design 
matrices  were  drawn. 


Under  (2.3),  J  =  /3  —  /3o  will  obey  the  following  "restricted  condition"  with  probability  at 
least  1  —  a: 

c+l 


\Stc\\i  <  c\\6t\\i,    where    c  :  — 


1 


(2.6) 


Therefore,  in  order  to  get  convergence  rates  in  the  prediction  norm  ||^||2,n  —  yl^nfrpp,  we 
consider  the  following  modulus  of  continuity  between  the  penalty  and  the  prediction  norm: 

V~s¥\kn 


RE.l(c)    Ki{T) 


mm 


STc\\i<c\\ST\\i,SjiO      \\5t\\i 


where  ki(T)  can  depend  on  n.  In  turn,  the  convergence  rate  in  the  usual  Euclidian  norm  \\5\\ 
is  determined  by  the  following  modulus  of  continuity  between  the  prediction  norm  and  the 
Euclidian  norm: 

RE.2(c)    k.2(T):=  min  &^, 

\\5tc\\i<c\\6t\\i,5^0     \\d\\ 
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where  n-iiT)  can  depend  on  n.  Conditions  RE.l  and  RE. 2  are  simply  variants  of  the  original 
restricted  eigenvalue  conditions  imposed  in  Bickel,  Ritov  and  Tsybakov  [2].  In  what  follows,  we 
suppress  dependence  on  T  whenever  convenient. 

Lemma  1  below  states  the  rate  of  convergence  in  the  prediction  norm  under  a  data-driven 
choice  of  penalty. 

Lemma  1  (Essentially  in  Bickel,  Ritov,  and  Tsybakov  [2]).  If  X  >  cn\\S\\oo,  then 

cj  riKi 
Under  the  data-driven  choice  (2.3),  we  have  with  probability  at  least  1  —  a 


/3oh,n  <    , 

C/    UKi 


||^-/?0i|2.n<(l  +  c)^A(l-Q|X), 

nK\ 


where  A(l  —  a\X)  <  a^y2n\og(p/a). 

Thus,  provided  ki  is  bounded  away  from  zero,  LASSO  estimates  the  regression  function  at 
nearly  the  rate  \/s/n  (achievable  when  the  true  model  T  is  known)  with  probability  at  least 
1  —  a.  Since  5  =  P  —  Po  obeys  the  restricted  condition  with  probability  at  least  1  —  a,  the  rate 
in  the  Euclidian  norm  immediately  follows  from  the  relation 

\W-/3o\\2<\\p-m2,n/^2,  (2.7) 

which  also  holds  with  probability  at  least  1  —  a.  Thus,  if  K2  is  also  bounded  away  from  zero, 
LASSO  estimates  the  regression  coefficients  at  a  near  \J s/n  rate  with  probability  at  least  1  —  a. 
Note  that  the  \fsjn  rate  is  not  the  oracle  rate  in  general,  but  under  some  further  conditions 
stated  in  Section  2.3,  namely  when  the  parametric  model  is  the  oracle  model,  this  rate  is  an 
oracle  rate. 

2.2.  Model  2:  Nonparametric  modeL  Next  we  consider  the  nonparametric  model  given  by 

yj  = /(zj)  +  e^,    €i  ~  A^(0,(J^),    i  =  l,...n, 

where  y,;  are  the  outcomes,  Zi  are  vectors  of  fixed  regressors,  and  e,  are  disturbances.  For 
Xi  =  p[zi),  where  pi^Zi)  is  a  p-vector  of  transformations  of  z-i  and  any  conformable  vector  /?o, 
and  /,  =  /(2j)i  ^^  can  rewrite 

Vi  =  3;-/?o  +  Uj,    u^=  r,  +  e,,    where   r,  :=  /,  -  x-/3o. 
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Next  we  choose  our  target  or  "true"  Pq,  with  the  corresponding  cardinahty  of  its  support 
s  =  ||/3ol|o  =  1^1  as  any  solution  to  the  following  "oracle"  risk  minimization  problem: 

k 
min       mm   EJix[P  -  0]  +  (T^ - .  (2.8) 

0<fc<pAn||/3||o<fc       ■'■^    '  '   '    '  n 

Letting 

cl:=En[rf]  =  E,,[{x%-f,f]  - 

denote  the  error  from  approximating  /,;  by  x'^Po,  then  c^  +  a's/n  is  the  optimal  value  of  (2.8). 
In  order  to  simplify  exposition,  we  focus  some  results  and  discussions  on  the  case  where  the 
following  holds: 

-cl<  Ka^s/n  (2.9) 

with  K  =  1  which  covers  most  cases  of  interest.  Alternatively,  we  could  consider  an  arbitrary 
K  which  does  not  affect  the  results'  modulo  constants. 

Note  that  c^  +  a~s/n  is  the  the  expected  estimation  error  £'[E„[/i  —  0:^/3°]^]  of  the  (infeasible) 
oracle  estimator  p°  that  minimizes  the  expected  estimation  error  among  all  fc-sparse  least  square 
estimators,  by  searching  for  the  best  fc-dimensional  model  and  then  choosing  k  to  balance 
approximation  error  with  the  sampling  error,  which  the  oracle  knows  how  to  compute.    The 


rate  of  convergence  of  the  oracle  estimator  \/c'^  +  a'^sjn  becomes  an  ideal  goal  for  the  rate 
of  convergence,  and  in  general  can  be  achieved  only  up  to  logarithmic  terms  in  most  cases 
(see  Donoho  and  Jonstone  [7]  and  Rigollet  and  Tsybakov  [14]),  except  under  very  special 
circumstances,  such  as  when  it  becomes  possible  to  perform  perfect  model  selection.  Finally, 
note  that  when  the  approximation  error,  c^,  is  zero  the  oracle  model  becomes  the  parametric 
model  of  the  previous  section  where  we  had  r^  =  0. 

Next  we  state  a  rate  of  convergence  in  the  prediction  norm  under  the  data-driven  choice  of 
penalty. 

Lemma  2  (Essentially  in  Bickel,  Ritov,  and  Tsybakov  [2]).  If  X  >  cn\\S\\oo,  then 

||^-/3o||2,„<  (1  +  ^)^  +  2^. 
\         cj   riKi 

Under  the  data-driven  choice  (2.3),  we  have  with  probability  at  least  I  —  a 

\\P  -  Pohn  <  (1  +  c)  ^A(l  -  a\X)  +  2c,, 


where  A(l  —  a\X)  <  a^2nlog{p/a). 
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Thus,  provided  ki  is  bounded  away  from  zero,  LASSO  estimates  the  regression  function  at 
a  near-oracle  rate  with  probability  at  least  1  —  q.  Furthermore,  the  bound  on  empirical  risk 
follows  from  the  triangle  inequality: 


Y^E„[x',^-/,]2<  j|^-/?o||2,„  +  c,.  (2.10) 

2.3.  Model  Selection  Properties.  The  primary  results  we  develop  do  not  require  the  first- 
step  estimators  like  LASSO  to  perfectly  select  the  true  model.  In  fact,  wc  arc  specifically 
interested  in  the  most  common  cases  where  these  estimators  do  not  perfectly  select  the  true 
model.  For  these  cases,  we  will  prove  that  post-model  selection  estimators  such  as  post-LASSO 
achieve  near-oracle  rates  like  those  of  LASSO.  However,  in  some  special  cases,  where  perfect 
model  selection  is  possible,  these  estimators  can  achieve  the  exact  oracle  rates,  and  thus  can 
be  even  better  than  LASSO.  The  purpose  of  this  section  is  to  describe  these  very  special  cases 
where  perfect  model  selection  is  possible. 

In  the  discussion  of  our  results  on  post-penalized  estimators  we  will  refer  to  the  following 
model  selection  results  for  the  parametric  case. 

Lemma  3  (Essentially  in  Meinshausen  and  Yu  [13]  and  Lounici  [11]).  1)  In  the  parametric 
model,  if  the  coefficients  are  well  separated  from  zero,  that  is 

min  |/3o,  I  >  ^  +  i,       for  t  >  ^  :=    max    \pj  -  Poj  \ , 
jeT  j  =  i,...,p 

then  the  true  model  is  a.  subset  of  the  selected  model,  T  :—  support(/3o)  C  T  :=  support(/3). 
Moreover  T  can  be  perfectly  selected  by  applying  trimming  of  level  t  to  j3: 

T  =  f(i):={je{l,...,p}    :    \Pj\>t]. 


^<      1  + 


2)  In  particular,  if  X  >  cn||5||,^,  then 

l\    X^s 
cj  nn\K2 

3)  In  particular,  if  X  >  cn]|5[|oo,  and  there  is  a  u  >  I  such  that  the  design  matrix  satisfies 
\En[xijXik\\  <  l/(u(l  +  2c)s)  for  all  1  <  j  <  k  <  p,  then 

.       ^  (  2 


Thus,  we  see  from  parts  1)  and  2),  which  follow  from  [13]  and  Lemma  2,  that  perfect  model 
selection  is  possible  under  strong  assumptions  on  the  coefficients'  separation  away  from  zero. 
We  also  see  fi-om  part  3),  which  is  due  to  [11],  that  the  strong  separation  of  coefficients  can  be 
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considerably  weakened  in  exchange  for  a  strong  assumption  on  the  design  matrix.  Finally,  the 
following  extreme  result  also  requires  strong  assumptions  on  separation  of  coefficients  and  the 
design  matrix. 

Lemma  4  (Essentially  in  Zhao  and  Yu  [21]).  In  the  parametric  model,  under  more  restrictive 
conditions  on  the  design,  separation  of  non-zero  coefficients,  and  penalty  paro.meter,  specified 
in  [21],  with  a  high  probability 

T  =  support(/3o)  =  T  =  support(/3). 

Comment  2.2.  We  only  review  model  selection  in  the  parametric  case.  There  are  two  reasons 
for  this:  first,  the  results  stated  above  have  been  developed  for  the  parametric  case  only,  and 
extending  them  to  nonparametric  cases  is  outside  the  main  focus  of  this  paper.  Second,  it 
is  clear  from  the  stated  conditions  that  in  the  nonparametric  context,  in  order  to  select  the 
oracle  model  T  perfectly,  the  oracle  models  have  to  be  either  (a)  parametric  (i.e.  c^  =  0)  or  (b) 
very  close  to  parametric  (with  Cg  much  smaller  than  a'^s/n)  and  satisfy  other  strong  conditions 
similar  to  those  stated  above.  Since  we  only  argue  that  post-LASSO  and  related  estimators  are 
as  good  as  LASSO  and  can  be  strictly  better  only  in  some  cases,  it  suffices  to  demonstrate  the 
latter  for  case  (a).  Moreover,  if  oracle  performance  is  achieved  for  case  (a),  then  by  continuity 
of  empirical  risk  with  respect  to  the  underlying  model,  the  oracle  performance  should  extend 
to  a  neighborhood  of  case  (a),  which  is  case  (b). 

3.  A  Generic  Result  on  Post-Model  Selection  Estimators 

Let  /3  be  any  first-step  estimator  acting  as  a  model  selection  device  and  denote  by 

T  :=  support(/3) 

the  model  selected  by  this  estimator;  we  assume  \T\  <  n  throughout.  Define  the  post-model 
selection  estimator  as 

^=arg  min  Q(/3).  (3.11) 

If  model  selection  works  perfectly  (as  it  will  under  some  rather  stringent  conditions),  then  this 
estimator  is  simply  the  oracle  estimator  and  its  properties  are  well  known.  However,  of  more 
interest  is  the  case  when  model  selection  does  not  work  perfectly,  as  occurs  for  many  designs 
of  interest  in  applications.  In  this  section  we  derive  a  generic  result  on  the  performance  of  any 
post-model  selection  estimator. 
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In  order  to  derive  rates,  we  need  the  following  minimal  restricted  sparse  eigenvalue 

RSE.l(m)    K(m)^  :=         min  ,,  ^, '" 


l2,n 


as  well  as  the  following  maximal  restricted  sparse  eigenvalue 

RSE.2(m)    (f>(m)  :=         max         - 

\\STc\\o<inMO     W^r 

where  m  is  the  restriction  on  the  number  of  non-zero  components  outside  the  support  T.  It 
will  be  convenient  to  define  the  following  condition  number  associated  with  the  sample  design 
matrix: 

(3.12) 


m) 


l-^m 


x/0C 

K(m) 


The  following  theorem  establishes  bounds  on  the  prediction  error  of  a  generic  second-step 
estimator. 

Theorem  1  (Performance  of  a  generic  second-step  estimator).  In  either  the  parametric  model 
or  the  nonparametric  model,  let  6  be  any  first-step  estimator  with  support  T,  define 

Bn  ■■=  QiP)  -  Q{/3o)    and   C„  :=  Q{P^f)  -  Q(/3o), 

and  let  (3  the  second-step  estimator.    For  any  e  >  Q,  there  is  a  constant  K^  independent  of  n 
such  that  with  probability  at  least  I  -  e,  we  have  that  for  m  :=  \T  \  T\ 


W-M2.n<K,a 


mlogp+  [m  +  s)\ogiiff^ 


+  2c,  +  V(5„)+  A  (C„)h 


where  c^  =  0  in  the  parametric  model.    Furthermore.   B^  and  Cn  obey  bounds  (3.13)  stated 
below. 


The  following  lemma  bounds  B^  and  C„,  although  in  many  cases  we  can  bound  D^  by  other 
means,  as  we  shall  do  in  the  LASSO  case. 

Lemma  5  (Generic  control  of  B^  and  Cn).  Let  m  =\T\T\  be  the  number  of  wrong  regressors 
selected  and  k  =  \T  \  T\  be  the  number  of  correct  regressors  missed.  For  any  e  >  0  there  is  a 
constant  K^  independent  of  n  such  that  with  probability  at  least  1  —  s. 


Bn      <      11/3  -, 


Il2,»j 


+ 


/vo" 


mlogp  +  (in  4-  s) 


:iJ-n 


+  2cs 


C„     <     l{T^f}[\\p^f^\\l, 


Mfn±l^g^^2c, 
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Three  implications  of  Theorem  1  are  worth  noting.  Firstly  and  most  importantly,  note  that 
the  bounds  on  the  prediction  norm  stated  in  Theorem  1  and  Lemma  5  apply  to  any  generic  post- 
model  selection  estimator,  provided  we  can  bound  both  the  rate  of  convergence  ||/3  — /3o||2,n  of  the 
first-step  estimator  and  m,  the  number  of  wrong  regressors  selected  by  the  first-step  estimator. 

Secondly,  note  that  if  the  selected  model  contains  the  true  model,  T  C  T,  then  we  have 
{B,i)+  A  {Cn)+  =  Cr,  =  0,  and  Bn  does  not  affect  the  rate  at  all,  and  the  performance  of  the 
second-step  estimator  is  dictated  by  the  sparsity  fh  of  the  first-step  estimator,  which  controls 
the  magnitude  of  the  empirical  errors.  Otherwise,  if  the  selected  model  fails  to  contain  the 
true  model,  that  is,  T  ^  T,  the  performance  of  the  second-step  estimator  is  determined  by 
both  the  sparsity  ih  and  the  minimum  between  Bn  and  C^.  Intuitivelj',  B„  measures  the  in- 
sample  goodness-of-fit  (or  loss-of-fit)  induced  by  the  first-step  estimator  relative  to  the  "true" 
pai'ameter  value  po,  and  C„  measures  the  in-sample  loss-of-fit  induced  by  truncating  the  "true" 
parameter  /3o  outside  the  selected  model  T. 

Finally,  note  that  rates  in  other  norms  of  interest  immediately  follow  from  the  following 
relations: 


^E„[2:;/i-/,]2<  ||/3-/Jo||2,„  +  c„    \\p-/3oh<\\/3-po\hn/^m),  (3.14) 

where  m  =  \T\T\. 

The  proof  of  Theorem  1  and  Lemma  5  relies  on  the  sparsity-based  control  of  the  empirical 
error  provided  by  the  following  lemma. 

Lemma  6  (Sparsity-based  control  of  noise).  1)  For  any  e  >  0,  there  is  a  constant  K^  indepen- 
dent of  n  such  that  with  probability  at  least  I  —  e, 


\Q{Bo  +  d)  -  Q(/3o)  -  mini  <  ^^e(^\j mb^n  +  2Cs||(5||2,n, 

uniformly  for  all  5  &W  such  that  Pt^IIo  <  tt^,  (^''T-d  uniformly  over  m.  <  n, 
where  c^  =  0  in  the  parametric  model.  2)  Furthermore,  with  at  least  the  same  probability, 


|Q(,V)  -  Q(/?o)  -  l|/3ofcllLl  <  i^e^y  °^^'^^^^°^^'°ll/3ofJl2,n  +  2c,||/3o^.||2,„, 
uniformly  for  all  T  c  T  such  that  \T  \T\  =  k,  and  uniformly  over  k  <  s, 

where  Cs  =  0  in  the  parametric  model. 
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The  proof  of  the  lemma  in  turn  rehes  on  the  following  maximal  inequality,  which  we  state  as 
a  separate  theorem  since  it  may  be  of  independent  interest.  The  proof  of  the  theorem  involves 
the  use  of  Samorodnitsky-Talagrand's  inequaUty. 

Theorem  2  (Maximal  inequality  for  a  collection  of  empirical  processes).  Let  £j  ~  N{0,cj'^)  be 
independent  for  i  —  I, . . .  ,n,  and  for  m  =  1, . . . , n  define 


Cnim,,])  :=  a2V2     ./log  +  v/(m  +  s)  log  (D/x^)  +  ^{m  +  s)log{l/T]) 


for  any  rj  G  (0, 1)  and  some  universal  constant  D.   Then 

e.x'J 


sup 

\\6Tc\\o<mM5\\2,n>0 


Gn 


<  eri{m..  rj),  for  all  m  <  n, 


with  probability  at  least  1  -  r]e   */(!  —  1/e). 


Proof.  Step  0.    Note  that  we  can  restrict  the  supremum  over  \\6\\  =  1  since  the  function  is 
homogenous  of  degree  zero. 

Step  1.  For  each  non-negative  integer  m  <  n,  and  each  set  T  C  {1,  •  • .  ,p},  with  |r\T|  <  m, 
define  the  class  of  functions 

g^;  =  {£,a-;;d7||(5||2,„  :  support((5)  C  f ,  ||<5||  -  1}.  (3.15) 

Also  define 

J^m  =  {Qf-fc{l,...,p}:\f\T\<m]. 
It  follows  that 

P  I  sup  |G,(/)|  >  e„(m,;?)  1  <  f  ^)     max    P  I  sup  |G„(/)|  >  e„(m,7?)  |  .  (3,16) 

V/G-^-n  /  V'"/   \f\T\<m        \f&Qj.  J 

We  apply  Samorodnitsky-Talagi'and's  inequality  (Proposition  A. 2. 7  in  van  der  Vaart  and 
Wellner  [18j)  to  bound  the  right  hand  side  of  (3.16).  Let 


P(f,g)  :=  v/^e[G„(/)  -  G„(5)]2  =  v/S,E„[(/-5)-l 
for  f,g€  Qf\  by  Step  2  below,  the  covering  number  of  Qf  with  respect  to  p  obeys 

N[e,  Gf,  p)  <  (6a/u™/f )"'+',  for  each  0  <  e  <  a,  (3.17) 

and  cr^iGf)  ■—  max/gg_  ii[G„(/)]^  =  a~.  Then,  by  Samorodnitsky-Talagrand's  inequality 

P  f  sup  |G„(/)|  >  e„(m,,/)^  <  ( ^^Ip^^hplY^^  ^e„Xm,v)M 
UeGf  J        \     \/m  +  s(j~     J 
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for  some  universal  constant  D  >  I.   For  en[m,Tii)  defined  in  the  statement  of  the  theorem,  it 
follows  that 

'  V 


P\   sup  |G„(/)!>en(m,7?)      <  r/e-^-V 


Then, 


P      sup  |G„(/)|  >  e„(m,7?),3m  <  n\     <     V  P      sup  |G„(/)|  >  e„(777..7]) 

n 

<     ^  r?e—  <  7?e-7(l  -  1/e), 

m=0 

which  proves  the  claim  of  the  theorem. 

Step  2.  This  step  establishes  (3.17).  For  t  gW  and  t  gW,  consider  any  two  functions 

aj^  and  e^^^  in  Gf,  for  a  given  f  C  {1,  ...,p}  ■.\f\T\<m. 


m\2,n  '\\t\\2,n 


We  have  that 


\ 


E^En 


i^'^t)     (^'J) 


\t\ 


2,n         ll^lb.n 


< 


E,En 


{<it-'t)f 


,2\-^i 


\t 


2,n 


E^En 


i^'it)        «i) 


M2,n         \\t\\2,n 


By  definition  of  Qf  in  (3.15),  support(i)  C  T  and  support(i)  C  T,  so  that  support(i  —  t)  C  T, 
\f\T\<  777,,  and  ||i||  =  I  by  (3.15).  Hence  by  definitions  RSE.l(7r7,)  and  RSE.2(r7z), 


EcEn 


\2,n 


<     a  4'{Tn)\\t  —  i||~/K(77i)",  and 


EtEn 


(x'^t)         {x'^t) 


\\t\\2,n  \\th,n 


=      E^En 


2{<tr    /ll^lb.n-IKIkn 


'\m 


l^lb.ri 


,2   (    \\A\2,n  -  \\A\2,n  \      ^2ur       ,,,2      /n.u2 
=      "      [ Whn )      ^"Il^-^"l2-/ll^ll2,n 

<     a^(l>{jn)\\t-t\\~/K{m)~. 


Thus 


EcEn 


[x[t)       [xrt] 


<  2a\\t  -  t\\^^H^)/K.im)  =  2afi,r,]\t  -  ^H 


\  11^12,™  \\t\\2,n^ 

Then  the  bound  (3.17)  follows  from  the  bound  in  [18]  page  94  with  R  =  'lafim  for  any  e  <  a. 
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Proof  of  Theorem  1.  Let  6  :=  p—  6q.  By  definition  of  the  second-step  estimator,  it  follows 
that  Q(^)  <  Q(p)  and  Q0)  <  Q(/3of  )■  Thus, 

Q0)  -  QiPo)  <  {Q{P)  -  QiPo))  A  [QiP^f)  -  Q(/?o))  <  Sn  A  Cn- 

By  Lemma  6  part  ( 1 ) ,  for  any  e  >  0  there  exists  a  constant  K^  such  that  with  probability  at 
least  \  —  e: 

,:.        W)  -  Q(/3o)  -  I|?||2,„I   <  ^e,n||?||2,n  +  2c,||?||2,n 

where 

Ae'n.  ■■=  A'eCTv'lrnlogp-t-  {m  +  s)  log  ij.ff,)/n. 

Combining  these  relations  we  obtain  the  inequality 

ll^llln  -  ^.,nll^l|2.n  -  2c,||^||2,n  <  B.  A  C„, 

solving  which  we  obtain  the  stated  result: 


\S\\2,n  <  A,^n  +  2Cs  +  V(^n)+  A  (C„)  +  . 


D 


4.  A  Generic  Result  on  Post-Trimmed  Estimators 

In  this  section  we  investigate  post-trimmed  estimators  that  arise  from  applying  unpenalized 
least  squares  in  the  second-step  to  the  models  selected  by  trimmed  estimators  in  the  first  step. 
Fonnally,  given  a  first-step  estimator  P,  we  define  its  trimmed  support  at  level  i  >  0  as 

f(t):={je{l,...,p}:\Pj\>t}. 

We  then  define  the  post-trimmed  estimator  as 

^'  =  arg    min_  Q(/3).  (4.18) 

The  traditional  trimming  scheme  sets  the  trimming  threshold  t  >  t  —  maxi <j<p  |/3j  —  /3ojh 
so  that  to  trim  all  small  coefficient  estimates  smaller  than  the  uniform  estimation  error  L  As 
discussed  in  Section  2.3,  this  method  is  particularly  appealing  in  parametric  models  in  which 
the  non-zero  components  are  well  separated  from  zero,  where  it  acts  as  a  very  effective  model 
selection  device.  Unfortunately,  this  scheme  may  perform  poorly  in  parametric  models  with 
true  coefficients  not  well  separated  from  zero  and  in  nonparametric  models.    Indeed,  even  in 
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parametric  models  with  many  small  but  non-zero  true  coefficients,  trimming  the  estimates 
too  aggressively  may  result  in  large  goodness-of-fit  losses,  and  consequently  in  slow  rates  of 
convergence  and  even  inconsistency  for  the  second-step  estimators.  This  issue  directly  motivates 
our  new  goodness-of-fit  based  trimming  method,  which  trims  small  coefficient  estimates  as  much 
as  possible  subject  to  maintaining  a  certain  goodness-of-fit  level.  Unlike  traditional  trimming, 
our  new  method  is  completely  data-driven,  which  makes  it  appealing  for  practice.  Moreover, 
our  method  is  at  least  as  good  as  LASSO  or  post-LASSO  theoretically,  but  performs  better 
than  both  of  these  methods  in  a  wide  range  of  experiments,  practically.  In  the  remainder  of  the 
section  we  present  generic  performance  bounds  for  both  the  new  method  and  the  traditional 
trimming  method. 

4.1.  Goodness-of-Fit  Trimming.  Here  we  propose  a  trimming  method  that  selects  the  trim- 
ming level  t  based  on  the  goodness-of-fit  of  the  post-trimmed  estimator.  Let  7  <  0  denoted  the 
maximal  allowed  loss  (gain)  in  goodness-of-fit  (goof)  relative  to  the  first-step  estimator.  We 
define  the  goof-trimming  threshold  tj  as  the  solution  to 

^  :=  nmx{t  :  Q{p')  -  Q0)  <  7}.  (4.19) 

Then  we  define  the  selected  model  and  the  post-goof-trimmed  estimators  as: 

f:^f{t-,)     and      P -.^  p'^ .  (4.20) 

Our  construction  (4.19)  and  (4.20)  selects  the  most  aggi'essive  trimming  threshold  subject  to 
maintaining  a  certain  level  of  goodness-of-fit  as  measured  by  the  least  squares  criterion  function. 
Note  that  we  can  compute  the  data-driven  trimming  threshold  (4.19)  very  effectively  using  a 
binary  search  procedure  described  below. 

Theorem  3  (Performance  of  a  generic  post-goof- trimmed  estimator).  In  either  the  param.etric 
or  the  nonparametric  model,  let  /3  be  any  first-step  estimator,  m  :=  |T  \  T\,  and  Bn  :—  Q{P)  - 
QiPo)  and  C„  :=  Qift^f)  -  QiPo).  For  any  s  >  0,  there  is  a  constant  K^  independent  of  n  such 
that  with  probability  at  least  I  —  e 


iifl      «  II       ^  t'         m.\ogp+{m  +  s)\og^i.in   ,„     ^     /(    ^  n  \ — A  ir^  \  ('4  21) 

\\P  -  Polb.n  <  -f^eCry  ] ^  2Cs  -f-  ^(7  +  Bn)+  f\  (C„)  +  ,  l^-^-^J 

where  Cg  =  0  in  the  parametric  model.    Furthermore,  Bn   and  Cn  obey  bounds  (3.13)  stated 
earlier,  with  T  —  T. 

Note  that  the  bounds  on  the  prediction  norm  stated  in  Theorem  3  and  eciuation  of  (3.13) 
in  Lemma  5  apply  to  any  generic  post-goof-trimmed  estimator,  provided  we  can  bound  both 
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the  rate  of  convergence  ||/3  —  /3o||2,?x  of  the  first-step  estimator  and  m,  the  number  of  wrong 
regressors  selected  by  the  trimmed  first-step  estimator.  For  the  purpose  of  obtaining  rates, 
we  can  often  use  the  bound  fh  <  m,  where  m  is  the  number  of  wrong  regressors  selected  by 
the  first-step  estimator,  provided  the  bounds  on  m  are  tight,  as,  for  example,  in  the  case  of 
LASSO.  Of  course,  fh  is  potentially  much  smaller  than  fh,  resulting  in  a  smaller  variance  for 
the  post-goof-trimmed  estimator.  For  instance,  in  the  case  of  LASSO,  we  can  even  have  fh  —  0, 
if  the  conditions  of  Lemma  3  on  perfect  model  selection  in  the  parametric  model  hold  with  the 
threshold  t  =  t^. 

Also,  note  that  if  the  selected  model  contains  the  true  model,  that  is  T  C  T,  then  we  have 
{Bn)+  A  (C„)+  =  C„  =  0,  and  these  terms  drop  out  of  the  rate.  Lemma  3  provides  sufficient 
conditions  for  this  to  hold  for  the  given  threshold  t  —  tj.  Otherwise,  if  the  selected  model  fails 
to  contain  the  true  model,  that  is,  T  <^  T,  the  performance  of  the  second-step  estimator  is 
determined  by  both  fh  and  i?„  A  C„. 

Comment  4.1  (Recommended  choice  of  7).  A  nice  feature  of  the  theorem  above  is  that  it 
allows  for  a  wide  range  of  choices  of  7.  The  simplest  choice  with  good  theoretical  guarantees  is 

.     .  7  =  0, 

which  requires  there  to  be  no  loss  of  fit  relative  to  the  first-step  estimator.  We  can  also  use 
any  (feasible)  7  <  0,  since  a  negative  7  actually  requires  the  second-step  estimator  to  gain  fit 
relative  to  the  first-step  estimator.  This  makes  sense,  since  the  first-step  estimator  can  suffer 
from  a  large  regularization  bias.  Consequently,  our  recommended  data-driven  choice  is 

■       -  7=^^^\^^<0,  (4.22) 

where  p'^  is  the  post-trimmed  estimator  for  ^  =  0.  The  theoretical  guarantees  of  this  choice  are 
comparable  to  that  of  7  =  0,  but  this  proposal  led  to  the  best  performance  in  our  computational 
experiments.  Note  that  if  we  could  set  'y+Bn  =  0  ,  which  is  not  practical  and  not  always  feasible, 
we  would  eliminate  the  second  term  in  the  rate  bound  (4.21).  Since  i?„  «  Q{P)  -  o"~  >  0  ,  if  /? 
has  a  substantial  regularization  bias,  then  we  have  7  <  0.  Although  this  choice  is  not  available 
in  general,  it  provides  a  simple  rationale  for  choosing  7  <  0  as  we  did  in  (4.22). 

Comment  4.2  (Efficient  computation  of  ty).  For  any  7,  we  can  compute  the  value  t-y  by  a 
binary  search  over  t.  Since  there  are  at  most  \T\  possible  relevant  values  of  t,  we  can  compute 


t-y  exactly  by  running  at  most    log2  \T\ 


unpenalized  least  squai^es  problems. 
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Proof  of  Theorem.  3.  Let  S  :=  J3  -  Po-  By  definition  Q0)  <  Q0)  +  7,  so  that    . 

Q0)  -  QiM  <  7  +  Q0)  -  Qm  =  1  +  Br,. 
On  the  other  hand,  since  /3  is  a  minimizer  of  Q  over  the  support  T,  Q{P)  <  Q{PQf)  so  that 

QiP)  -  QiPo)  <  QiPof)  -  QiM  =  Cn- 

By  Lemma  6  part  (1),  for  any  £  >  0,  there  is  a  constant  K^  such  that  with  probabihty  at  least 
1  -£ 

■  mln-A,,nm2,n-2Csm2,n<Q{(3)-Qm, 

where 


Ae,n  ■=  K^(TyJ[m  log  J)  +  [m  +  s)  log^(.^)/n. 
Combining  the  inequalities  gives 

ll^llln  -  ^.,n||?||2.n  -  2c,||j'||2,n  <  (7  +  B„)  A  C„. 

Solving  this  inequality  for  ||(5||2,n  gives  the  stated  result.  D 

4.2.  Traditional  Trimming.  Next  we  consider  the  traditional  trimming  scheme,  which  is 
based  on  the  magnitude  of  the  estimated  coefficients.  Given  the  first-step  estimator  P,  define 
the  trimmed  first-step  estimator  Pt  by  setting  Ptj  =  Pj\{\Pj\  >  t]  for  j  =  1,  ...,p.  Finally  define 
the  selected  model  and  the  post-trimmed  estimator  as 

f  =  f(t)     and      p  =  p'.  (4.23) 

Let  m;  =  |T  \  T|  denote  the  components  selected  outside  the  support  T,  m-t  :=  \T  \  T\  the 
number  of  trimmed  components  of  the  first-step  estimator,  and  7^  :=  \\pt  —  P\\2,n  the  prediction 
norm  distance  from  the  first-step  estimator  ,5  to  the  trimmed  estimator  Pt- 

Theorem  4  (Performance  of  a  generic  post-traditional-trimmed  estimator).  In  either  the  para- 
metric or  the  nonparametnc  model,  let  p  be  any  first- step  estimator,  and  let  Bn  :=  Q{P)  —  Q{Pq) 
and  Cn  :=  QiPnf)  —  QiPo)-  for  any  £  >  0,  there  is  a  constant  K^  independent  of  n  such  that 
with  probability  at  least  1  —  e 


\\P  -  Po\\2,n     <     K^a^/{Tnlogp-\-  {m  +  s)  log  fiif,,)/n-h  2c s  + 


H-     J-,t{K,aGt  +  2cs  +  7f  +  2\\p  -  Poh^n)  +  [Bn)+  A  ^/(C„)+, 


where  Gt  =  ^'nrit  log(p/i,„,  )/?t.,  7(  <  t yj (t){mt)mt ,  cmd  Cg  —  0  m  the  parametric  model.  Further- 
more, Bn  and  Cn  obey  bounds  (3.13)  stated  earlier,  with  T  =  T. 
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Note  that  the  bounds  on  the  prediction  norm  stated  in  Theorem  4  and  equation  (3.13)  in 
Lemma  5  apply  to  any  generic  post-traditional-trimmed  estimator.  All  components  of  the 
bounds  are  easily  controlled,  just  as  in  the  case  of  Theorem  3.  A  major  determinant  of  the 
performance  is  7(  which  measures  loss-of-fit  due  to  trimming.  If  the  trimming  threshold  is  too 
aggi^essive,  for  example,  as  suggested  in  the  model  selection  Lemma  3  (2),  then  "/(  can  be  very 
large.  Indeed,  in  the  parametric  models  with  true  coefficients  not  well  separated  from  zero  and 
in  the  nonpaxametric  models,  aggressive  trimming  can  result  in  large  goodness-of-fit  losses  7(, 
and  consequently  in  very  slow  rates  of  convergence  and  even  inconsistency  for  the  second-step 
estimators.  We  further  discuss  this  issue  in  the  next  section  in  the  context  of  LASSO.  There 
are  of  course  exceptional  cases  where  good  model  selection  is  possible.  One  example  is  the 
parametric  model  with  well-separated  coefficients,  where  T  C  T  wp  — >  1  so  that  C„  =  0  wp 
— >  1,  which  eliminates  dependence  of  performance  bounds  on  jt  completely. 

Comment  4.3  (Traditional  trimming  based  on  goodness-of-fit).  We  can  fix  some  drawbacks 
of  traditional  trimming  by  selecting  the  threshold  t  to  imply  at  most  a  specific  loss  of  fit  jf  For 
a  given  jt  >  0,  we  can  set  t  =  max{t  :  \\Pt  —  /?||2,n  <  7(}-  This  choice  uses  maximal  trimming 
subject  to  maintaining  a  certain  goodness-of-fit  level,  as  measured  by  the  prediction  norm.  Our 
theorem  above  formally  covers  this  choice.  However,  it  is  not  easy  to  specify  practical,  data- 
driven  7(.  Our  main  proposal  described  in  the  previous  subsection  resolves  just  such  difficulties. 

Proof  of  Theorem  4-  Let  5  :=  /3  —  /3o,  6*  :—  Pt  —  Po,  and  5  :=  P  —  Pq.  By  definition  of  the 
estimator,  Q{P)  <  Q{Pt)  A  Q(Pf^f),  so  that 

Q{P)  -  QiPo)  <  (Q(A)  -  QiPo))  A  (QiPof)  -  QiPo))  <  {QiPt)  -  Q(P)  +  Bn)  A  Cn 

since  B„  =  Q{P)  -  Q(/3o). 

By  Lemma  6  (1),  for  any  e  >  0  there  is  a  constant  A'^-^i  such  that  with  probability  at  least 
\-e/2 

mln  -  Ae,nmkn  -  2c,||J||2,n  <  QiP)  "  Q{Po), 

where 


A^^n  ■■=  I<^^ia\J[m\ogp  +  {m  +  s)  log^im)/n. 
On  the  other  hand,  we  have 

Q{Pt)-Q{p)    =    QiPt)  -  QiPo)  +  QiPo)  -  QiP) 

=    2En[e^x[iPt  -  /?)]  +  2E„[r,x',iPt  -  P)]  +  ||?|li,„  -  Plli,„ 
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To  bound  the  terms  above,  note  first  that  by  Theorem  2,  there  is  a  constant  K^^2  such  that 
with  probability  at  least  1  —  e/2 

|2E„M^(A  -  ^)]|  <  aA',,2GH|^t  -  ^||2,„, 
and,  second,  by  Cauchy-Schwartz  |2E„[rjxJ(/?(  -  I3)]\  <  2cs\\(it  -  P\\2,n-  Moreover, 

W^'Wln-mln      =      {\\S'hn-\\Shn)m\2,n  +  m2,n) 
<       ||^-^||2,n(||A-/3||2.n  +  2||?||2,„). 

Combining  these  inequalities  and  using  that  7;  =  \\Pt  —  /3||2,n,  we  obtain  with  probability  at 
least  1  —  e 

ll^llln  -  ^,nil^l|2,n  "  2c,||J||2,„  <   (aA^e,2G,7«  +  2c,7,  +  7^(7^  +  2||?||2.„)  +  /?„)  A  C„. 

Thus,  solving  the  resulting  quadratic  inequality  for  ||(5||2,n;  we  obtain 


2,n  <  Ae,„  +  2c,  +  W    aK.flGat  +  2c,7t  +  lt{lt  +  2||d"||2,n)  +  {Bn)+     A  (C„)  + 


which  gives  the  stated  result  by  taking  K,,  =  Ks,i  V  K^2-    Also,  note  that  7;  <  t\J(j>[mt)mt 
follows  by  the  Caucliy-Schwartz  inequality  and  the  definition  of  (t){,mt).  D 

5.  Post  Model  Selection  Results  for  LASSO 

In  this  section  we  specialize  our  results  on  post-penalized  estimators  to  the  case  of  LASSO 
being  the  first-step  estimator.  The  previous  generic  results  allow  us  to  use  sparsity  bounds  and 
rate  of  convergence  of  LASSO  to  derive  the  rate  of  convergence  of  post-penalized  estimators 
in  the  parametric  and  nonparametric  models.  We  also  derive  new  sharp  sparsity  bounds  for 
LASSO,  which  may  be  of  independent  interest. 

5.1.  A  new,  oracle  sparsity  bound  for  LASSO.  We  begin  with  a  preliminary  sparsity 
bound  for  LASSO. 

Lemma  7  (Empirical  pre-sparsity  for  LASSO).  In  either  the  parametric  model  or  the  nonpara- 
metric model,  let  fh=\T\  T\  and  A  >  c  ■  n||5||oo-    ^Ve  have 


Vm  <  ^s^J(t,l(m)  2c/ Ki  +  i{c  +  l)yJ<P[m)  nCs/X. 
where  C5  —  0  m  the  parametric  model. 
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The  lemma  above  states  that  LASSO  achieves  the  oracle  sparsity  up  to  a  factor  of  (t){rn). 
The  lemma  above  immediately  yields  the  simple  upper  bound  on  the  sparsity  of  the  form 

_  m<ps(P{n),  (5.24) 

as  obtained  for  example  in  [2]  and  [13].  Unfortunately,  this  bound  is  sharp  only  when  (f){n)  is 


bounded.  When  (p[n)  diverges,  for  example  when  </)(n)  >p  \/logp  in  the  Gaussian  design  with 
■p  >  2n,  the  bound  is  not  sharp.  However,  for  this  case  we  can  construct  a  sharp  sparsity  bound 
by  combining  the  preceding  pre-sparsity  result  with  the  following  sub-linearity  property  of  the 
restricted  sparse  eigenvalues. 

Lemma  8  (Sub-linearity  of  restricted  sparse  eigenvalues).  For  any  integer  k  >0  and  constant 
i>l  we  have  0(("^/c])  <  \{](p{k). 

A  version  of  this  lemma  for  unrestricted  eigenvalues  has  been  previously  proven  in  [1].  The 
combination  of  the  preceding  two  lemmas  gives  the  following  sparsity  theorem.  Recall  that  we 
assume  c^  <  osj sjn  and  for  a  <  1/4  we  have  A(l  -  Oi\X)  >  ay^. 

Theorem  5  (Sparsity  bound  for  LASSO  under  data-driven  penalty).  In  either  the  parametric 
model  or  the  nonparametric  model,  consider  the  LASSO  estimator  with  A  >  cA(l  -  n\X), 
a  <  1/4,  Cs  <  a^/s/n.  and  let  m  :-  \f  \T\.  Consider  the  set  M  -  {m  G  N  :  m,  > 
S(f>{m  A  n)  ■  2(2c/ki  +  3(c  —  1))'}.    With  probability  at  least  I  —  a 

(2c  _  " 

m  <  s  ■  min  (b(m  A  n) h  3(c  -  1) 

The  main  imphcation  of  Theorem  5  is  that  if  imnmeM  4>{'iti  A  n)  <p  1,  which  we  show  below 
to  be  valid  in  Lemmas  9  and  10  for  important  designs,  then  with  probability  at  least  1  —  a 

m<ps.  ■  (5.25) 

Consequently,  for  these  designs,  LASSO's  sparsity  is  of  the  same  order  as  the  oracle  sparsity, 
namely  s  :=  \T\  <  s+rh  <p  s  with  high  probability.  The  reason  for  this  is  that  iiimmeM  '/'("^)  ^ 
<p{n)  for  these  designs,  which  allows  us  to  sharpen  the  previous  sparsity  bound  (5.24)  considered 
in  [2]  and  [13].  Also,  our  new  bound  is  comparable  to  the  bounds  in  [20]  in  terms  of  order  of 
sharpness,  but  it  requires  a  smaller  penalty  level  A  which  also  does  not  depend  on  the  unknown 
sparse  eigenvalues  as  in  [20]. 

Next  we  show  that  miximeM  0("i  /^  ^)  ^P  1  f™"  two  very  common  designs  of  interest,  so  that 
the  bound  (5.25)  holds  as  a  consequence.  As  a  side  contribution,  we  also  show  that  for  these 
designs  all  the  restricted  sparse  eigenvalues  and  restricted  eigenvalues  defined  earlier  behave 
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nicely.  We  state  these  results  in  asymptotic  form  for  the  sake  of  exposition,  although  we  can 
convert  them  to  finite  sample  form  using  the  results  in  [20]  and  Lemma  7. 

The  following  lemma  deals  with  a  Gaussian  design;  it  uses  the  standard  concept  of  (unre- 
stricted) sparse  eigenvalues  (see,  e.g.  [2])  to  state  a  primitive  condition  on  the  population  design 
matrix. 

Lemma  9  (Gaussian  design).  Suppose  Xi,  i  =  1, . . .  ,n,  are  i.i.d.  zero-mean  Gaussian  random 
vectors,  such  that  the  population  design  matrix  E[xjiJ]  has  ones  on  the  diagonal,  and  its  s  logn- 
sparse  eigenvalues  are  hounded  from  above  by  ip  <  oo  and  bounded  from  below  by  k^  >  0.  Define 
Xj  as  a  normalized  form  of  Xi,  namely  x,j  =  x^j/ ,/En[x'~A.  Then  for  any  m  <  (slog(n/e))  A 
(n/[161ogp]),  with  probability  at  least  1  -  2exp(— n/16), 

0(Tn)  <  8(^,     K{mY  >  K  /72,     and   fim.  <  24y^/K. 

Therefore,  under  the  conditions  of  Theorem  5  and  n/(s  log p)  — >  oo,  we  have  that  as  n  -^  oo 

/  2c  " 

m<s-  (Syj)     —  +  3(c-  i; 

with  probability  approaching  at  least  1  —  a,  where  we  can  take  ki  >  k/24. 

The  following  lemma  deals  with  arbitrary  bounded  regressors. 

Lemma  10  (Bounded  design).  Suppose  x^  i  =  1, . . .  ,n,  are  i.i.d.  bounded  zero-mean  random 
vectors,  with  n'iaxi<i<:„,i<j<p\xij\  <  Kb  for  all  n  and  p.  Assurne  that  the  population  design 
matrix  E[xix[]  has  ones  on  the  diagonal,  and  its  slogn-sparse  eigenvalues  are  bounded  from 
above  by  ip  <  oo  and  bounded  from,  below  by  k"  >  0.  Define  x^  as  a  normalized  form  of  x^, 
namely  x,j  =  i,j/, /En[i'?  ].  Then  there  is  a  constant  £  >  0  such  that  if  sJTi/Kb  — ^  C)o  and 
m  <  (slog(n/e))  A  ( [e / K'b ] \/n/ log p) ,  we  have  that  as  ?^  —^  oo 

(f>{m.)  <  Aip,    K(m)"  >  ti" /A,     and   /Um  <  Ay/lp/K, 

with  probability  approaching  1.     Therefore,   under  the  conditions  of  Theorem  5  and  provided 


i/n/ ( Kb s Vlog p )  ^  oo,  we  have  that  as  n  -^  oo, 

(2c  ^'^ 

m<s-  {4f)     —  +3(c-  1 

with  probability  approaching  at  least  1  —  0-,  where  we  can  take  k\  >  k,/8. 
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Proof  of  Theorem  5.  The  choice  of  A  imphes  that  with  probabihty  at  least  1  —  q  we  have 
A  >  c  ■  n||5||oo-  111  that  event,  by  Lemma  7 

Va  <  v/0(m)  •  2cyf^/Ki  +  3(c+  l)\/0(m)  •  ncjA, 

which  can  be  rewritten  as 


2c      „,_  .   .,  nc,  \ 


2 


rn<s-(/)(m)     — +3(c+l)-^       .  (5.26) 

Note  that  m  <  n  by  optimahty  conditions.    Consider  any  M  G  M.,  and  suppose  m  >  M. 
Therefore  by  Lemma  8  on  sublinearity  of  sparse  eigenvalues 


fh  <  s 


777. 

M 


<^(M)f-  +  3(c +!)"''■''" 


Thus,  since  [/c]  <  2k  for  any  A:  >  1  we  have 


M  <  s-2(f)(M)     —  +3(c+l) 


2c nc^\ 

which  violates  the  condition  on  M  and  s  since  c^  <  a\J sjn,  A  >  cay/n,  and  (c  +  l)/c  =  c  —  1. 
Therefore,  we  must  have  777.  <  M. 

In  turn,  applying  (5.26)  once  more  with  7Ti  <  [M  A  77)  we  obtain 

/2c  nc.  \^ 

fn<s-(/)(7WA7i)     — +3(c+l) — ^      . 

Further,  using  again  that  Cg  <  a\J sjn  and  A  >  casjn  we  have 

2c- 


m<s-  (t){M  A  77)     —  +  3(c  -  1 
since  (c  +  l)/c  =  c  —  1.  The  result  follows  by  minimizing  the  bound  over  M  e  M-  D 

5.2.  Performance  of  the  post-LASSO  Estimator.  Here  we  show  that  the  post-LASSO 
estimator  enjoys  good  theoretical  performance  despite  possibly  "poor"  selection  of  the  model 
by  LASSO. 

Theorem  6  (Performance  of  post-LASSO).  /r7  either  the  parametric  model  or  the  nonparamet- 
ric  model,  if  X  >  C7i||5||oo>  for  any  e  >  0  there  is  a  constant  A'^  independent  of  n  such  that  with 
probability  at  least  1  —  e 


Th  y     ^^1         \  CTlK-i 
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where  m  :=  \T  \  T\  and  Cg  —  Q  in  the  parametric  model.  In  particular,  under  the  data-driven 
choice  of  X  specified  in  (2.3)  with  log(l/Q:)  <  logp,  jor  any  e  >  0  there  is  a  constant  K'^  ^  such 
that 


-/9o!|2,n  <K,a^ 


mlogjpfj-m)   J     /a  log  Mm    J   ifrg  fi  ,  /-g^ogP  ^ 


n      Ki 
with  probability  at  least  I  —  a  —  e.      ' 


(5.27) 


Proof  of  Theorem  6.  Note  that  by  the  optimaHty  of  p  in  the  LASSO  problem,  and  letting 

Bn:^QW)-Qm    <^(||/3o||i-||^||i)<^(||?T||i-||?rHli)-  (5-28) 

IfZ?„  :=  ||5r'-||i  >  c||?r||i,  we  have  Q(^)-Q(/3o)  <  0  since  c>  1.  Otherwise,  if  ||5rHli  <  c||?t||i, 
by  RE.  1(c)  we  have 

^      B.  :=  m  -  Q(/3o)  <  -fhh  <  ^^^^&^.  (5.29) 

n  n        Ki 

The  result  follows  by  applying  Lemma  2  to  bound  ||(5||2,n  and  Theorem  1,  and  also  noting  that 
if  T  C  f  we  have  C„  =  0  so  that  i3„  A  C„  <  l{r  g  f}S„. 

The  second  claim  is  immediate  from  the  first,  using  the  condition  that  Cg  <  a^fsjn,  relation 
(2.9),  in  the  case  of  the  nonparametric  model.  □ 

This  theorem  provides  a  performance  bound  for  post-LASSO  as  a  function  of  1)  LASSO's 
sparsity  characterized  by  m,  2)  LASSO's  rate  of  convergence,  and  3)  LASSO's  model  selection 
ability.  For  common  designs  this  bound  imphes  that  post-LASSO  performs  at  least  as  well  as 
LASSO,  but  it  can  be  strictly  better  in  some  cases,  and  has  smaller  regularization  bias.  'We  pro- 
vide further  theoretical  comparisons  in  what  follows,  and  computational  examples  supporting 
these  comparisons  appear  in  Section  6.  It  is  also  worth  repeating  here  that  performance  bounds 
in  other  norms  of  interest  immediately  follow  by  the  triangle  inequality  and  by  definition  of  k: 


v/e^S^"^  <  11/3  -  M2,n  +  cs    and  ||/J  -  ^2  <  WH  -  /3o||2,„/«(m).  (5.30) 

Comment  5.1  (Comparison  of  the  performance  of  post-LASSO  vs  LASSO).  In  order  to  carry 
out  complete  and  formal  comparisons  between  LASSO  and  post-LASSO,  we  assume  that 

<P{m)<pl,    Ki  >p  1,  Mm  ^P  1-    log(l/a)  <  logp   and    a  =  o{\).  (5.31) 

We  established  fairly  general  sufficient  conditions  for  the  first  three  relations  in  Lemmas  9  and 
10.  The  fourth  relation  is  a  mild  condition  on  the  choice  of  a  in  the  definition  of  the  data-driven 
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choice  (2.3)  of  penalty  level  A,  which  simplifies  the  probability  statements  in  what  follows.  We 
first  note  that  vmder  (5.31)  post-LASSO  with  the  data-driven  penalty  level  A  specified  in  (2.3) 
obeys: 

'" '  '.s 


\\P-Po\kn 


<, 


mlogp 


+  1{T<ZT} 


s  logp 


n  V  n  V       n 

In  addition,  conditions  (5.31)  and  Theorem  5  imply  the  oracle  sparsity  in  <p  s. 

It  follows  that  post-LASSO  generally  achieves  the  same  near-oracle  rate  as  LASSO: 


ll/3-/3o||2,n 


<, 


slogp 


(5.32) 


Notably,  this  occurs  despite  the  fact  that  LASSO  may  in  general  fail  to  correctly  select  the 
oracle  model  T  as  a  subset,  tliat  isT  <^T. 

Furthermore,  there  is  a  class  of  well-behaved  models  -  a  neighborhood  of  parametric  models 
with  well-separated  coefficients  -  in  which  post-LASSO  strictly  improves  upon  LASSO.  Specif- 
ically, if  m  =  op{s)  and  T  CT  wp  — >  1,  as  under  conditions  of  Lemmas  3  and  4,  then 


b.n  £p      CT 


o(s)logp 


+ 


(5.33) 


That  is,  post-LASSO  strictly  improves  upon  LASSO's  rate.  Finally,  in  the  extreme  case  of 
perfect  model  selection,  when  m  =  0  and  T  C  T  wp  — >  1,  as  under  conditions  of  Lemma  4, 
post-LASSO  naturally  achieves  the  oracle  performance:  ||/3  -  /3o||2,7i  ^p   cr^/s/n.     D 

5.3.  Performance  of  the  post-goof-trimmed  LASSO  estimator.  In  what  follows  we  pro- 
vide performance  bounds  for  the  post-goof-trimmed  estimator  /j  defined  in  equation  (4.20)  for 
the  case  where  the  first-step  estimator  /3  is  LASSO. 

Theorem  7  (Performance  of  post-goof-trimmed  LASSO).  In  either  the  parametric  model  or 
the  nonparametric  model,  if  X  >  cn||5||oo,  for  any  e  >  0  there  is  a  constant  K^  independent  of 
n  such  that  with  probability  at  least  I  —  e 


||^-/3ol|2,„<AVy^^°g^+^'^  +  ^)^°g^"~'+2c.  +  l{Tgr},/^fii±^^l^  +  2c., 


UKl 


cnK\ 


where  m,  :—  \T  \  T\  and  C5  =  0  m  the  parametric  case.     Under  the  data-driven  choice  of  A 
specified  in  (2.3)  with  log(l/a)  <  logp,  for  any  e  >  Q  there  is  a  constant  K'^^  such  that 


||/9-/3o||2,n</<„a 


Tn\og[piifh] 


+ 


slog/x^ 


\{T(IT] 


slogp  1 

n      Ki 


(5.34) 


with  probability  at  least  1  —  a  —  e. 
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Proof.  The  proof  of  the  first  claim  follows  the  same  steps  as  the  proof  of  Theorem  6,  invoking 
Theorem  3  in  the  last  step.  The  second  claim  follows  immediately  from  the  first,  where  we  also 
use  the  condition  Cg  <  g^JsJu  from  (2.9)  in  the  nonparametric  model,  in  addition  the  condition 
7  <  0  imposed  in  the  construction  of  the  estimator.  D 

This  theorem  provides  a  performance  bound  for  post-goof-trimmed  LASSO  as  a  function  of 
1)  its  spai'sity  characterized  by  m,  2)  LASSO's  rate  of  convergence,  and  3)  the  model  selection 
ability  of  the  trimming  scheme.  Generally,  this  bound  is  at  least  as  good  as  the  bound  for  post- 
LASSO,  since  the  post-goof-trimmed  LASSO  trims  as  much  as  possible  subject  to  maintaining 
certain  goodness-of-fit.  It  is  also  appealing  that  this  estimator  determines  the  trimming  level  in  a 
completely  data-driven  fashion.  Moreover,  by  construction  the  estimated  model  is  sparser  than 
post-LASSO's  model,  which  leads  to  the  superior  performance  of  post-goof-trimmed  LASSO 
over  post-LASSO  in  some  cases.  We  further  provide  further  theoretical  comparisons  below  and 
computational  examples  in  Section  6. 

Comment  5.2  (Comparison  of  the  performance  of  post-goof-trimmed  LASSO  vs  LASSO  and 
post-LASSO).  In  order  to  carry  out  complete  and  formal  comparisons,  we  assume  condition 
(5.31)  as  before.  Under  these  conditions,  post-goof-trimmed  LASSO  obeys  the  following  per- 
formance bound: 


/^olb.n  ^P  cr 


mlogp    ,      \s    ,   ^iJ^^f^^.\^^''^V 


n  V       n 


which  is  analogous  to  the  bound  for  post-LASSO,  since  m  <  fn.  <p  s  by  conditions  (5.31)  and 
Theorem  5.  It  follows  that  in  general  post-goof-trimmed  LASSO  matches  the  near  oracle  rate 
of  convergence  of  LASSO  and  post-LASSO: 


\\P~Pohn<p<r]j~^-  (5.35) 

Nonetheless,  there  is  a  class  of  models  -  a  neighborhood  of  parametric  models  with  well- 
separated  coefficients  -  for  which  improvements  upon  the  rate  of  convergence  of  LASSO  is 
possible.  Specifically,  if  rii  —  op(s)  and  T  C  T  wp  -^  1  then  we  obtain  the  performance  bound 
(5.33),  that  is,  post-goof-trimmed  LASSO  strictly  improves  upon  LASSO's  rate.  Furthermore, 
if  TO  =  op{m)  and  T  C  T  wp  — >  1,  post-goof-trimmed  LASSO  also  outperforms  post-LASSO: 


-  /Jolb.n  ^P      <7 


o(m)  log p  s_ 


n  V  n 
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Lastly,  under  conditions  of  Lemma  3  holding  for  t  =  t^,  post-goof-trimmed  LASSO  achieves 


the  oracle  performance, 


/3n||2,n  <p     ay/sjn. 


U 


5.4.  Performance  of  the  post-traditional-trimmed  LASSO  estimator.  Next  we  con- 
sider the  traditional  trimming  scheme  which  truncates  to  zero  all  components  below  a  set 
threshold  t.  This  is  arguably  the  most  used  trimming  scheme  in  the  literature.  To  state  the 
result,  recall  that  ft^-  =  ^jl{|^j|  >  t],  m  :=  \f\T\,  nit  :=  \f\f\  and  7,  :==  ||3f  -  Phn  where 
P  is  the  LASSO  estimator. 

Tiieorem  8  (Performance  of  post-traditional-trimmed  LASSO).  In  either  the  parametric  model 
or  the  nonparametric  model,  if  X  >  cn||5||oo>  for  any  e  >  0  there  is  a  constant  K^-  independent 
of  n  such  that  with  probabtUty  at  least  I  -  £  we  have 


Pohn  <  K,J^^^2iP±S^l±l}}2^  +  2C., 


'IlKi     \  C  CIlKi  J 


-l{T^T}\ht{KeCTGt  +  6c, 


where  Gt  —  \/mt  logipum,) / y/n  and  7^  <  t\J(j)(mi)mt.  Under  the  data-driven  choice  of  A 
specified  in  (2.3)  for  log(l/Q')  <  logp,  for  any  e  >  0  there  is  a  constant  K'^  ^  such  that  with 
probability  at  least  \  -  a  —  e 


||/3-/3o||2.n       <       /C,a 

+  l{T^f}     7* 


mlogippLf, 


S  logoff, 


n  V      n      Ki 


Proof.  The  proof  of  the  first  claim  follows  the  same  steps  as  the  proof  of  Theorem  6;  invoking 
Theorem  4  in  the  last  step.  The  second  claim  follows  from  the  first,  where  we  also  use  the 
condition  Cs  <  o^/s/n,  relation  (2.9),  for  the  nonparametric  model.  D 


This  theorem  provides  a  performance  bound  for  post-traditional-trimmed  LASSO  as  a  func- 
tion of  1)  its  sparsity  characterized  by  m  and  improvements  in  sparsity  over  LASSO  chai-- 
acterized  by  m;,  2)  LASSO's  rate  of  convergence,  3)  the  trimming  threshold  t  and  resulting 
goodnesR-of-fit  loss  7;  relative  to  LASSO  induced  by  trimming,  and  4)  model  selection  ability 
of  the  trimming  scheme.  Generally,  this  bound  may  be  worse  than  the  bound  for  LASSO,  and 
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this  arises  because  the  post-traditional-trimmed  LASSO  may  potentially  use  too  much  trim- 
ming resulting  in  large  goodness-of-fit  losses  7(.  We  provide  further  theoretical  comparisons 
below  and  computational  examples  in  Section  6. 

Comment  5.3  (Comparison  of  the  performance  of  post-traditional-trimmed  LASSO  vs  LASSO 
and  post-LASSO).  In  this  discussion  we  also  assume  conditions  (5.31)  made  in  the  previous  for- 
mal comparisons.  Under  these  conditions,  post-traditional-trimmed  LASSO  obeys  the  bound: 

\\P-Pohn<a^l^  +  a^^  +  l{T^f}Lva^[^Y  (5.36) 

In  this  case  we  have  mWrrit  <  s  +  rh  <p  s  by  Theorem  5,  and,  in  general,  the  rate  above  cannot 
improve  upon  LASSO's  rate  of  convergence  given  in  Lemma  2. 

As  expected,  the  choice  of  t,  which  controls  7(  via  the  the  bound  jt  <  ty^(f){mt)'mt,  can  have 
a  large  impact  on  the  performance  bounds: 


t<J'^    =^     \W-Pohn<.J'-^  (5.37) 


Slogp      ^  ~  /s2logp 

V      n  V       n 

Both  options  are  standard  suggestions  in  the  literature  on  model  selection  via  LASSO,  as  we 
reviewed  in  Lemma  3  parts  (2)  and  (3).  The  first  choice  (5.37),  suggested  by  [11],  is  theoretically 
sound,  since  it  guarantees  that  post-traditional-trimmed  LASSO  achieves  the  near-oracle  rate 
of  LASSO.  The  second  choice,  however,  results  in  a  very  poor  performance  bound,  and  even 
suggests  inconsistency  if  s'^  is  large  relative  to  n.  Note  that  to  implement  the  first  choice  (5.37) 
in  practice  we  can  set  t  —  X/n. 

Furthermore,  there  is  a  special  class  of  models  -  a  neighborhood  of  parametric  models  with 
well-separated  coefficients  -  for  which  improvements  upon  the  rate  of  convergence  of  LASSO  is 
possible.  Specifically,  if  m  =  op{s)  and  T  C  T  wp  — >  1  then  we  recover  the  performance  bound 
(5.33),  that  is,  post-traditional-trimmed  LASSO  strictly  improves  upon  LASSO's  rate.  Further- 
more, if  rfi  =  op{m)  and  T  C  T  wp  ->  1,  post-traditional-trimnied  LASSO  also  outperforms 
post-LASSO: 


-  /Solb.n  ^P 


o(?n.)  log  p         /  s 
n  V  n 


Lastly,   under  the  conditions  of  Lemma  3  holding  for  the  given  i,  post-traditional-trimmed 
LASSO  achieves  the  oracle  performance,  ||/3  —  /iolb.n  ^p   a%J sjn.  D 
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■  6.  Empirical  Performance  Relative  to  LASSO 

In  this  section  we  assess  the  finite  sample  performance  of  the  following  estimators;  1)  LASSO, 
which  is  our  benchmark,  2)  post-LASSO,  3)  post-goof-trimmed  LASSO,  and  4)  post-traditional- 
trimmed  LASSO  with  the  trimming  threshold  t  —  X/n  suggested  by  Lemma  3  part  (3).  We 
consider  a  "parametric"  and  a  "nonparametric"  model  of  the  form: 

J/i  =  /i  +  f-i,    fi  =  3:%,     ei^N{Q,(T^),    i  =  l,...,n, 

where  in  the  parametric  model 

^0  =  C- [1,1, 1,1,1, 0,0,.. .,0]',  (6.39) 

and  in  the  nonparametric  model 

So  =  C[1, 1/2,1/3,. ..,1/p]'.  (6.40) 

The  parameter  C  determines  the  size  of  the  coefficients,  representing  the  "strength  of  the 
signal",  and  we  vary  C  between  0  and  2.  The  number  of  regi'essors  is  p  =  500,  the  sample  size 
is  n  =  100,  the  variance  of  the  noise  is  ct'^  =  1,  and  we  used  1000  simulations  for  each  design. 
We  generate  regressors  from  the  normal  law  Xi  ~  A''(0,  E),  and  consider  three  designs  of  the 
covariance  matrix  S:  a)  the  isotropic  design  with  Sj^.  —  0  for  j  ^  A',  b)  the  Toeplitz  design 
with  Ejx-  —  (l/2)l'^~''"l,  and  c)  the  equi-correlated  design  with  Ejfc  =  1/2  for  j  7^  k;  in  all  designs 
Ejj  =  1.  Thus  our  pai'ametric  model  is  very  sparse  and  offers  a  rather  favorable  setting  for 
applying  LASSO-type  methods,  while  our  nonparametric  model  is  non-sparse  and  much  less 
favorable. 

We  present  the  results  of  computational  experiments  for  each  design  a)-c)  in  Figures  2-4.  The 
left  column  of  each  figure  reports  the  results  for  the  parametric  model,  and  the  right  column 
of  each  figure  reports  the  results  for  the  nonparametric  model.  For  each  model  the  figin'cs  plot 
the  following  as  a  function  of  the  signal  strength  for  each  estimator  /3: 

•  in  the  top  panel,  the  number  of  regi'essors  selected,  |T|, 

•  in  the  middle  panel,  the  norm  of  the  bias,  namely  \\E[f3  -  Oo\\\, 

•  in  the  bottom  panel,  the  average  empirical  risk,  namely  E[En{fi  —  x'^P]'^]. 

We  will  focus  the  discussion  on  the  isotropic  design,  and  only  highlight  differences  for  other 
designs. 

Figure  2,  left  panel,  shows  the  results  for  the  parametric  model  with  the  isotropic  design.  We 
see  from  the  bottom  panel  that,  for  a  wide  range  of  signal  strength  C,  both  post-LASSO  and 
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post-goof-trimmed  LASSO  significantly  outperform  both  LASSO  and  post-traditional-trimmed 
LASSO  in  terms  of  empirical  risk.  The  middle  panel  shows  that  the  first  two  estimators' 
superior  performance  stems  from  their  much  smaller  bias.  We  see  from  the  top  panel  that 
LASSO  achieves  good  sparsity,  ensuring  that  post-LASSO  performs  well,  but  post-goof-trimmed 
LASSO  achieves  even  better  sparsity.  Under  very  high  signal  strength,  post-goof-trimmed 
LASSO  achieves  the  performance  of  the  oracle  estimator;  post-traditional-trimmed  LASSO 
also  achieves  this  performance;  post-LASSO  nearly  matches  it;  while  LASSO  does  not  match 
this  performance.  Interestingly,  the  post-traditional-trimmed  LASSO  performs  very  poorly  for 
intermediate  ranges  of  signal. 

Figure  2,  right  panel,  shows  the  results  for  the  nonparametric  model  with  the  isotropic 
design.  We  see  from  the  bottom  panel  that,  as  in  the  parametric  model,  both  post-LASSO  and 
post-goof-trimmed  LASSO  significantly  outperform  both  LASSO  and  post-traditional-trimmed 
LASSO  in  terms  of  empirical  risk.  As  in  the  parametric  model,  the  middle  panel  shows  that  the 
first  two  estimators  arc  able  to  outperform  the  last  two  because  they  have  a  much  smaller  bias. 
We  also  see  from  the  top  panel  that,  as  in  the  parametric  model,  LASSO  achieves  good  sparsity, 
while  post-goof-trimmed  LASSO  achieves  excellent  sparsity.  In  contrast  to  the  parametric 
model,  in  the  nonparametric  setting  the  post-traditional-trimmed  LASSO  performs  poorly  in 
terms  of  empirical  risk  for  almost  all  signals,  except  for  very  weak  signals.  Also  in  contrast  to 
the  parametric  model,  no  estimator  achieves  the  exact  oracle  performance,  although  LASSO, 
and  especially  post-LASSO  and  post-goof-trimmed  LASSO  perform  nearly  as  well,  as  we  would 
expect  from  the  theoretical  results. 

Figure  3  shows  the  results  for  the  parametric  and  nonparametric  model  with  the  Toeplitz 
design.  This  design  deviates  only  moderately  from  the  isotropic  design,  and  we  see  that  all  of 
the  previous  findings  continue  to  hold.  Figure  4  shows  the  results  under  the  equi-correlated 
design.  This  design  strongly  deviates  from  the  isotropic  design,  but  we  still  see  that  the  previous 
findings  continue  to  hold  with  only  a  few  differences.  Specifically,  we  see  from  the  top  panels 
that  in  this  case  LASSO  no  longer  selects  very  sparse  models,  while  post-goof-trimmed  LASSO 
continues  to  perform  well  and  selects  very  sparse  models.  Consequently,  in  the  case  of  the 
parametric  model,  post-goof-trimmed  LASSO  substantially  outperforms  post-LASSO  in  terms 
of  empirical  risk,  as  the  bottom-left  panel  shows.  In  contrast,  we  see  from  the  bottom  right 
panel  that  in  the  nonparametric  model,  post-goof-trimmed  LASSO  performs  equally  as  well  as 
post-LASSO  in  terms  of  empirical  risk,  despite  the  fact  that  it  uses  a  much  sparser  model  for 
estimation. 
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The  findings  above  confirm  our  theoretical  results  on  post-penalized  estimators  in  parametric 
and  nonparametric  models.  Indeed,  we  see  that  post-goof-trimmed  LASSO  and  post-LASSO  are 
at  least  as  good  as  LASSO,  and  often  perform  considerably  better  since  they  remove  penalization 
bias.  Post-goof-trimmed  LASSO  outperforms  post-LASSO  whenever  LASSO  does  not  produce 
excellent  sparsity.  Moreover,  when  the  signal  is  strong  and  the  model  is  parametric  and  sparse 
(or  very  close  to  being  such),  the  LASSO-based  model  selection  permits  the  selection  of  oracle 
or  near-oracle  model.  That  allows  for  post-model  selection  estimators  to  achieve  improvements 
in  empirical  risk  over  LASSO.  Of  particular  note  is  the  excellent  performance  of  post-goof- 
trimmed  LASSO,  which  uses  data-driven  trimming  to  select  a  sparse  model.  This  performance 
is  fully  consistent  with  our  theoretical  results.  Finally,  traditional  trimming  performs  poorly 
for  intermediate  ranges  of  signal.  In  particular,  it  exhibits  very  large  biases  leading  to  large 
goodness-of-fit  losses. 

Appendix  A.  Proofs  of  Lemmas  1  and  2 

Proof  of  Lemma  1.  Following  Bickel,  Ritov  and  Tsybakov  [2],  to  establish  the  result  we 
make  the  use  of  the  following  relations  for  J  =  /3  —  /3  and  for  A  >  cn||5||oo: 

W)-Qm   >    -\\s\U\s\U  +  \\s\\l^>--^{\\STh  +  \\ST4i)  +  \min      (a.41) 

ll/3ol|i  -  ll^lli     =     IIAtIIi  -  WMi  -  llAVlli  <  \\St\\i  -  \\St4i-  (A.42) 

By  definition  of  5,  Q(/3)  -Q(/9o)  <  ^||/3o||i  -  ^ll^lli,  which,  by  (A.41)  and  (A.42),  implies  that 

-^(Pr||i  +  ||^T.||i)  +  ll^llln<^(ll'5T||i-||^rHli)-  (A.43) 

|2 


Since  ||cy||^_„  >  0, 

c-  1 


St4i  <^-|!'5r||i=c||(5T||i.  (A.44) 


Going  back  to  (A.43),  we  get  that: 


;)>Hi,<(i  +  i) 


H„    <    (l  +  il-llM,<fl  +  il-^/^"*■" 


cjn  \        c  J  n  K.\ 

where  we  used  that  c  >  1  and  invoked  RE. 1(c)  since  (A.44)  holds.  Solve  for  ||(5||2,n- 

Finally,  the  bound  on  A(l  -  a\X)  follows  from  the  union  bound  and  a  probability  inequality 
for  Gaussian  random  variables,  P(|^|  >  M)  <  exp{-AP/'2)  if  ^  ~  A^(0, 1),  see  Proposition 
2.2.1(a)  in  [8].  □ 
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Figure  2.  Tliis  figure  plots  the  performance  of  the  estimators  listed  in  the  text  under  the 
isotropic  design  for  the  covariates,  Sjj;  =  0  if  j  7^  k.  The  left  column  corresponds  to  the 
parametric  case  and  tlie  right  column  corresponds  to  the  nonparametric  case  described  in  the 
text.  The  number  of  regressors  is  p  =  500  and  the  sample  size  is  n  =  100  with  1000  simulations 
for  each  value  of  C . 
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Figure  3.  This  figure  plots  the  performance  of  the  estimators  listed  in  the  text  under  the 
Toeplitz  design  for  the  covariates,  Eja:  =  p'-'"*''  if  j  7^  k.  The  left  column  corresponds  to  the 
parametric  case  and  the  right  column  corresponds  to  the  nonparametric  case  described  in  the 
text.  The  number  of  regressors  is  p  =  500  and  the  sample  size  is  n  =  100  with  1000  simulations 
for  each  value  of  C. 
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Figure  4.  This  figure  plots  the  performance  of  the  estimators  listed  in  the  text  under  the 
equi-correlatcd  design  for  the  covariates,  Hjk  =  p  \i  j  ^  k.  The  left  column  corresponds  to  the 
parametric  case  and  the  right  column  corresponds  to  the  nonparametric  case  described  in  the 
text.  The  number  of  regressors  is  p  =  500  and  the  sample  size  is  n  =  100  with  1000  simulations 
for  eacli  value  of  C . 
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Proof  of  Lemma  2.    Similar  to  [2],  to  prove  Lemma  2  we  make  the  use  of  the  fohowing 
relation:  for  S  =  (3  —  Pq,  ii  X  >  cn\\S\\ao 

QW)  -  QWo)  -  ¥\\in    =    2Er,\e,x',5]  +  2En[r,x',6]>-\\S\U\5\U-2cMhn 

>     -^{\\ST\\i  +  \\ST4i)-^cMkn  (A.45) 

By  definition  of  P,  Q0)  -  Q(/3o)  <  ^||/3o||i  -  ^||/3||i,  which  implies  that 

-  ^(II'^tIIi  +  ||<5r<-||i)  +  ¥\\ln  -  2c.||(5||2.„  <  ^(||<5t||i  -  \\5t4i)  (A.46) 

If  ll'^llin  ~  2c5||d'||2,,j  <  0,  then  we  have  estabhshed  the  bound  in  the  statement  of  the  theorem. 
On  the  other  hand,  if  ||(5|||„  -  2cs||A'||2,n  >  0  we  get 

||<5Tc||l<^-||dT||l=c||5T||l,  (A.47) 

and  therefore  5  satisfies  the  domination  condition  (2.6).  From  (A.46)  and  using  RE. 1(c)  we  get 

||^|||„  -  2c.P|b,„  <  (l  +  1)  h^Th  <  (l  +  -)  ^^^ 
V        cjn  \        c  J     n       hi 

which  gives  the  result  on  the  prediction  norm.  Finally,  the  bound  on  A(l  —  Oi\X)  follows  from 
the  union  bound  and  a  probability  inequality  for  Gaussian  random  variables,  P(\£,\  >  M)  < 
exp{-M~/2)  if  ,J  ~  A'(0, 1),  see  Proposition  2.2.1(a)  in  [8].  D 

Appendix  B.  Proofs  of  Lemmas  for  Post-Model  Selection  Estimators 

Proof  of  Lemma  5.  The  bound  on  i?„  follows  from  Lemma  6  result  (1).  The  bound  on  C„ 
follows  from  Lemma  6  result  (2).  □ 

Proof  of  Lemma  6.  Result  (1)  follows  from  the  relation 

\Qi(3o  +  S)-  Q(/?o)  -  mlJ  =  \2E^[e^x',5]  +  2En[r,x[6]\, 

then  applying  Theorem  2  on  sparse  control  of  noise  to  |2E„[ejx'(5]|,  bounding  |2E„[ri.Tj(5]|  by 
2cs||5||2,n  using  the  Cauchy-Schwartz  inequality,  and  bounding  (^)  by  p™. 

Result  (2)  also  follows  from  Theorem  2  but  applying  it  with  s  =  0,  p  —  s  (since  only  the 
components  in  T  are  modified),  m  —  k,  and  noting  that  we  can  take  /v.^  with  m  =  0.  D 
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Appendix  C.  Proofs  of  Lemmas  for  Sparsity  of  the  LASSO  estimator 

Proof  of  Lemma  7.  Let  T  =  support(/i),  and  m  =   \T\T\.     We  have  from  the  optimaUty 
conditions  that 

2E„[xy(2/,  -  x'M  =  sign(^,)A/7i    for  each   jef\T. 

Therefore  we  have  for  R  =  {ri, . . . ,  ?-„)' 
VAX    =    2\\{X'{Y-Xp))^^^\\ 

<  2\\{X'{Y^R~XPo))f\T\\+niX'R)f^j,\\  +  2\\{X'X{Po~^))fyr\\ 

<  Vm  •  n||5||oo  +  2nV4>irn)cs  +  2nV'P{fh)\\P  -  /3o||2,«, 

where  we  used  that 

■■-      ||(A"X(/3o-^))^\^||     =    sup||„,|„<^,,|„i|<i|a'A'A(/3o-^)| 

<  sup||„,|^<^_||„||<i||a'X^|||jA(^o-^)|| 

=      SUP||ai|u<m,||a||<l  V\a' X' Xq\\\X{^q  -  ^)|| 

<  ny^||/3o  -  ;5||2,n, 
and  similarly  ||(A'i?)j^,^jj  <  n^J <^{fn)Cs. 

Since  A/c  >  7i||S'||oo,  and  by  Lemma  2,  ||/Jo  -  ^||2,n  <  (l  +  ^)  ^  +  2cs  we  have 


(1  -  l/c)\/^  <  2V4>[m)[\  +  1/c)/s/ki  +  6\/0(m)  nc,/A. 
The  result  follows  by  noting  that  (1  —  1/c)  —  2/(c  +  1)  by  definition  of  c.  D 

Proof  of  Lemma  8.  Let  W  ■.=  E„  {x,x'j]  and  a  be  such  that  (;6(  \['k] )  =  a' Wa.  We  can  decompose 

a  =  ^aj,  with  ^  ||a,Tc||o  =  HarHIo  and  a^r  =  ^r/  M  , 

i=l  (-1 

where  we  can  choose  a,'s  such  that  ||aiT'^||o  <  k  for  each  i  =  1, ...,  \f\,  since  [£]A;  >  [^fc].  Since 
ly  is  positive  semi-definite,  a[Wai  +  a'jWaj  >  2  |q'J'1^Qj|  for  any  pair  (i,  j).  Therefore 

^    J2J2- V^ '-  =  \(]Y.a^Wa, 

1=1  j  =  i  "  1=1 

<  m  5^||a,||20(||a,TMIo)  <  m  _max    </.(||a,:T-|lo)  <  m<^(^), 

I      !,-■■>  I  *-  I 
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where  we  used  that 

m  m  II-  ||2     r^i 


r^i      ,^1 


D 


Proof  of  Lemma  9.  First  note  that  P(maxi<j<p  |ctj  -  1|  <  1/4)  — >  1  as  n  grows  under  the  side 
condition  on  n.  Let  Ci,{m)  and  c*[m)  denote  the  minimum  and  maximum  m-sparse  eigenvalues 
associated  with  E„[a:,xJ]  (unnormahzed  covariates).  It  follows  that  0(m)  <  maxi<j<pCT|c*(m  + 
s)  and  Ac(m)^  >  mini<j<pCTJc,(m  +  s).  Thus,  the  bound  on  (t>{m)  and  «(m)^  follows  from  [20]'s 
proof  of  Proposition  2  (i)  with  ei  =  1/3,  £2  =  1/2,  and  £3  =  £4  =  1/16,  which  bounds  the 
deviation  of  c,,{m  +  s)  and  c*{m  +  s)  from  their  population  counterparts.  The  bound  on  the 
restricted  eigenvalue  ki  follows  from  Lemma  3  (ii)  in  [2].  Let  M  —  (slog(n/e))A(n/[161ogp])  so 
that  as  n  gxows  M/s  — >  00  under  the  side  condition  on  s,  and  we  have  M  e  A4  for  n  sufficiently 
large  since  ki  is  bounded  from  below  and  0(M)  is  bounded  from  above  with  probability  going 
to  one.  Thus,  the  bound  on  in  then  follows  from  Theorem  5  if  A  >  cnHSHoo  which  occurs  with 
probability  at  least  1  —  a.  D 

Proof  of  Lemma  10.  First  note  that  P(maxi<j<p  \aj  —  Ij  <  1/4)  H-  1  as  n  grows  under  the  side 
condition  on  n.  Let  Ct{m)  and  c*(7n)  denote  the  minimum  and  maximum  777-sparse  eigenvalues 
associated  with  E„[iiS^]  (unnormahzed  covariates).  It  follows  that  0(m)  <  maxi<j<pajc*{m.+ 
s)  and  K{m)^  >  mini<j<payCt(r??.  +  .s).  Thus,  the  bound  on  (pim)  and  ^(777)"  follows  from 
[20] 's  proof  of  Proposition  2  (ii)  with  r,.  =  1/2  and  r*  =  2.  which  bounds  the  deviation  of 
c,(7Ti  +  s)  and  c*(777  +  s)  from  their  population  counterparts.  The  bound  on  the  restricted 
eigenvalue  ki  follows  from  Lemma  3  (ii)  in  [2]  and  the  side  condition  on  s.  Next  let  M  — 
(slog(n/e))  A  ([e/A'sJ^/n/logp)  so  that  as  n  grows  M/s  — >  00,  under  the  side  condition  on 
s,  and  we  have  M  &  Jv\  for  n  sufficiently  large  since  k\  is  bounded  from  below  and  <p[M)  is 
bounded  from  above  with  probability  going  to  one.  Thus,  the  bound  on  m  then  follows  from 
Theorem  5  if  A  >  cr7||5||oo  which  occurs  with  probability  at  least  \  —  a.  □ 
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